In [1]:
from IPython.display import HTML
HTML('''<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js "></script><script>
code_show=true; 
function code_toggle() {
if (code_show){
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
} else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
}
code_show = !code_show
} 
$( document ).ready(code_toggle);</script><form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
''')
Out[1]:

The Context

Understanding my two-faced conundrum

I study two trades and live in two worlds, both of which I take seriously. As a 1-year data science student and a 20-year illustrator I find both fields invigorating for different reasons, and respect the dedication needed to be considered truly good at them. But with recent AI developments that have made these two worlds collide, I often find myself at a crossroads.

This is a deductive argument that has played in my head many times during my machine learning classes:

  1. All artists hate AI.
  2. I am an artist.
  3. Therefore, I hate AI.

The logic is sound. Except it's not. Because I don't hate AI (kind of. it's complicated). So which statement is untrue? The answer: the 1st statement. Not all artists actually hate AI.

AI is great as a tool that assists in the complex workflows that unpredictably arise in a creative studio. In the past year, every software service has quickly integrated AI into their tools, including creative ones. Examples of this are Canva, Figma, Photoshop, and Clip Studio Paint (which got rolled back due to severe backlash from users because... artists hate AI! We'll get into that more later). These are huge developments that complete a whole day's worth of tedious work into mere minutes. And big art studios are starting to use them.

For a very recent example, the newest animated Spiderverse movie used AI with spectacular success:

In [2]:
display(HTML('<center><blockquote class="twitter-tweet"><p lang="en" dir="ltr">Spiderverse had one of the largest teams of animators to *ever* work on an animated movie, so I can assure you that it didn’t steal anyone’s job. THIS is ethical use of AI. It’s not stealing from anyone, and it’s making the animators life easier by eliminating repetitive tasks. <a href="https://t.co/ObebDlMaP7">https://t.co/ObebDlMaP7</a></p>&mdash; JV (@javi_khoso) <a href="https://twitter.com/javi_khoso/status/1667463965532733440?ref_src=twsrc%5Etfw">June 10, 2023</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></center>'))

Spiderverse had one of the largest teams of animators to *ever* work on an animated movie, so I can assure you that it didn’t steal anyone’s job. THIS is ethical use of AI. It’s not stealing from anyone, and it’s making the animators life easier by eliminating repetitive tasks. https://t.co/ObebDlMaP7

— JV (@javi_khoso) June 10, 2023

Unfortunately, this positive sentiment is an exception to most creative AI developments than the rule (just look at what that tweet is quote retweeting). In most cases, after a company announces an AI-integrated feature, artists choose to boycott it. So what's the deal?

If AI is so helpful, why do (most) artists hate it?

As someone whose friends are exclusively art hobbyists and professionals, I know there are many reasons why artists hate AI! But it boils down to one thing: disrespect. Art is often taken for granted for being seen as a luxury than a necessity. It does not provide immediate benefit like the food from a restaurant, the utility of an appliance, the wellness given by fitness, healthcare, and medicine, or the security given by loans and insurance. But art is necessary, in the intangible long run. It keeps us sane. It gives us a space to dwell on deeper concepts. It is a craft.

This is an appreciation that most people don't share (and I get it, some people don't buy into flowery language). And so to most people, there is nothing wrong with seeing art as a process worth optimizing, than a craft worth enhancing.

Automating art is a complex issue, because in complex workflows, AI does help. But to the solo designer, the freelance illustrator, and the independent artist, automation is... kind of insulting! A big chunk of art's beauty comes in how it was made, not just in what it ends up being. And people who come into this space looking to make things better without this appreciative mindset are the equivalent of someone whacking a hornet's nest.

Little misunderstandings are what start wars

Some developers are well-intentioned. Some artists are open-minded. Good developers with good hearts want to make things that help artists make their lives easier. Open artists with open hearts are willing to try new things that will make their lives better. But the world is not full of these people. The world runs on capitalism. And in capitalism, even the most well-meaning people have plans that get lost in translation.

The threat of obsolescence

A society that does not appreciate the effort that goes behind your output is a society that will not hesitate to automate you in a heartbeat. This goes for artists, but it also goes for everyone, even coders, even software developers. If the C-level doesn't see what's so difficult about your job, you are skating on the same thin ice as everyone else.

Artists are quick to mock AI-generated images for their silly hands and uncanny valleys because they find comfort in the things that the model cannot do. That only they can do. Because it means their years of mastering art did not mean nothing. That AI is not the career-ender that can do the same thing as you, but better and faster.

Unfortunately, with training, AI is good at getting better and faster. Artists are already being laid off on the sidelines in favor of shortcutting to AI generation. And some people are going as far to publishing their own children's books made entirely off of AI images. Here's what popular children's book artist Anoosha Syed had to say about it:

In [3]:
display(HTML('<center><blockquote class="twitter-tweet"><p lang="en" dir="ltr">As a children’s author/illustrator, it is saddening to see these books bc, apart from the ethics of AI and stolen artwork, kids deserve better!!!<br><br>I’m tired of people who see kidlit as an easy get-rich-quick scheme and putting in the absolute minimal effort into their books <a href="https://t.co/I5WnlTiZgS">https://t.co/I5WnlTiZgS</a></p>&mdash; anoosha syed (@foxville_art) <a href="https://twitter.com/foxville_art/status/1602135206974345216?ref_src=twsrc%5Etfw">December 12, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></center>'))

As a children’s author/illustrator, it is saddening to see these books bc, apart from the ethics of AI and stolen artwork, kids deserve better!!!

I’m tired of people who see kidlit as an easy get-rich-quick scheme and putting in the absolute minimal effort into their books https://t.co/I5WnlTiZgS

— anoosha syed (@foxville_art) December 12, 2022

In the eyes of the artist, AI is a synonym for enemy. It doesn't matter to the big company if their promotional material was made by a model and not a human. If the result is acceptable and cheaper, then the process doesn't matter. When most of the world is ambivalent toward your plight because, unfortunately, the AI is just more efficient, it's hard for artists to be open-minded to other well-meaning AI tools that actually want to empower their craft than replace it.

Artstation, a huge art portfolio platform, is a popular place for people not only to post their art but to scrape art from. Many artists protested in late 2022 after realizing that the site would not ban AI art submissions. The protest was spam posting the site with "No AI" images for weeks. The protest was a tangible sign of the vitriol that many artists quickly came to after realizing how catastrophically disruptive AI was to creatives.

image.png

The act of theft

While at first AI-generated images were remarkable, silly concepts for memes, people have caught wind of how to finetune these models. Primarily trained on large corpuses of photos based on realism, people started exploring how to train the base to emulate specific, niche art styles, as broad as anime to as specific as an artist, like Ilya Kushinov, with devastatingly good results in using just a few good examples of artwork.

The side effect of using art to automate art is that that the art has to come from somewhere. AI is nothing without having data to learn from. And that data is most often scrapped off the internet and stolen from artists.

The aforementioned Artstation is one such scrapping site. Another art site, DeviantArt, tried implementing an AI generator alongside an updated feature that automatically opted-in any art you posted on the site to be consensually included into AI training datasets. Getty Images put Midjourney in hot water with a lawsuit for using their scraping images without consent. In all of these, scrapping was non-consensual, or underhandedly made to be consensual.

Other artists were specifically targeted with ill-intent for speaking out about their dislike of AI. The bigger the artist who spoke against it, the more people took it as mean-spirited challenge to make models out of them. People started creating style-specific models based on an existing artist's works without even attributing the artist in their model. For some big names, they were appalled to learn that including their username into a prompt would generate art similar to theirs.

A big example of this is Deb JJ Lee, who only learned that her art was used for dreambooth after a fan informed her that a generative model similar to her work was getting popular on an AI Subreddit.

In [4]:
display(HTML('<center><blockquote class="twitter-tweet"><p lang="en" dir="ltr">this….this is fucking sad. I don’t know what to do <a href="https://t.co/X7VL9TpPkU">pic.twitter.com/X7VL9TpPkU</a></p>&mdash; Deb JJ Lee (@jdebbiel) <a href="https://twitter.com/jdebbiel/status/1601663197031075840?ref_src=twsrc%5Etfw">December 10, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></center>'))

this….this is fucking sad. I don’t know what to do pic.twitter.com/X7VL9TpPkU

— Deb JJ Lee (@jdebbiel) December 10, 2022

The feeling of helplessness

With all of these incidents combined, you can start to see why artists who already had a hard time fighting for fair pay for their work and protection from art thieves find it hard to believe AI can be good for them.

Even I, as a person studying data science, who understands how AI could be good for an artist, find it hard to believe sometimes.

The Problem

So artists hate AI. What now?

While we won't know how AI and art will develop in the coming years (if the two will mix like water and salt, water and oil, or water and rubidium) -- we have to address the situation as it is presently. And presently, the most vulnerable player is the artist.

Arists need ways to defend themselves while the laws that will supposedly defend them have yet to pass. Many artists have protested, many companies have either embraced or rebuked AI, and some have devised preventative and curative tools to handle AI.

My problem statement is simple:

Are the tools that will help artists protect their livelihoods against malicious AI applications actually effective?

The Gameplan

I'm fighting in the war on artists, on the side of the artists!

Examples of helpful tools do either AI Art Detection or Original Art Protection, which I will be exploring for the rest of this report. I will be using these tools to see how they fair against Textual Inversion, a task performed by Stable Diffusion.

If these tools are as robust as they puport, then they should be able to detect the artificial images in my dataset and protect the original images in my dataset, respectively, with ease. Tools like this will be necessary for artists to maintain their livelihoods even after laws are passed to protect them, and so it is a point of interest to me that these tools, while still nascent, are not just snake oil.

methodology.png

Using a mini dataset of my own artwork, I will be be protecting and detecting these images and see how well they stand. Below is the process I'm taking and how these pieces fit together.

We will:

  1. Do textual inversion on Original Art to get Artificial Art
  2. Use mist on Original Art to get Misted Art
  3. Do textual inversion on Misted Art to get Artificial Misted Art
  4. Use AI detectors on Original, Artificial, Misted, and Artificial Misted art.

For evaluation, we will:

  1. Look at generated examples and compare them subjectively
  2. Infer using the AI detectors and check their accuracy

Before executing this, I'll explain the big concepts is in detail.

Textual Inversion with Stable Diffusion

For this project, I will be testing Textual Inversion using Stable Diffusion on a set of my own original art, and the same art that underwent art protection. The code I will be using for Textual Inversion has been taken from HuggingFace's official Colab Notebook guide, and has been integrated into this notebook.

Textual Inversion is one of multiple methods for diffusion models to learn specific ideas with just a few examples (as little as 3-5 sample images). It does this by taking new token S* into its embedding space as the embed v* and being given samples of what this token means. The model's associated sample then undergo the diffusion process as the sample images of this novel concept are preprocessed to create progressively blurry and noised out versions of itself, and the process of training the model to learn how to reverse the noising process takes place.

In simple terms, Textual Inversion is like learning a new hard word (token) in your vocabulary, and figuring out its meaning through the context of a few example sentences (sample images).

https://textual-inversion.github.io/static/images/training/training.JPG

Why are we using textual inversion? This is a method that is sometimes used to finetune a model to a specific art style. A token can be an object or a style, depending on the phrasing of the descriptive text you use when inserting your examples in the model. Just as you can make the model learn a new kind of teapot object, you can also make it learn a new kind of artist's artstyle, for example, *S could be a <teapot> with teapot examples, but it could just as well be <ilya-kushinov> with Ilya Kushinov art examples.

Because the model needs only a few samples to get the point, textual inversion is a terrifying task to any artist with even less than a two digit number of their artworks floating around on the internet.

https://textual-inversion.github.io/static/images/editing/colorful_teapot.JPG

Art Protection with Mist

I'll be pre-processing my original art through an Art Protection software, Mist, to see how well it messes up the Textual Inversion of a Stable Diffusion model.

image-3.png

Mist is a recently tool still in early development and is available for Windows and Linux. It watermarks images through giving a generative model adversarial examples. Mist comes built in with Stable Diffusion v1.4, takes an image, and creates adversarial examples through an algorithm coined as AdvDM, which creates these examples through a Monte-Carlo estimation by optimizing latent variables that are sampled during a diffusion model's reversing process.

In simpler terms, Mist runs the image provided through Stable Diffusion using textual inversion, figures out what the model is learning when it tries to reverse the image, or turn noise slowly back into the image, and probabalistically edits the image to obscure the things it is learning, making it see out-of-distribution features.

mist.jpg

For more about Mist:

  • This is their official page
  • This is their Github
  • This is the paper on the algorithm that created Mist, Adversarial Example Does Good: Preventing Painting Imitation from Diffusion Models via Adversarial Examples by Chumeng Liang, Xiaoyu Wu, Yang Hua, Jiaru Zhang, Yiming Xue, Tao Song, Zhengui Xue, Ruhui Ma, Haibing Guan
  • And this is the technical report on Mist, Mist: Towards Improved Adversarial Examples for Diffusion Models by Chumeng Liang and Xiaoyu Wu, which is an extension of the previous paper and will be presented this coming ICML 2023.

Art Detection with umm-maybe and Illuminarty

There have been a few attempts to detect art using models that train on examples. I will be looking at two:

  1. umm-maybe/AI-image-detector A Pre-Trained Model for binary classification on if an image is human or artificial. The model card is viewable here, and will be called in the notebook. You do not need to log in to huggingface to use this model. A caveat for this detector is that it is a general AI image detector. The creator of this model, Matthew Maybe, trained it out of curiosity by scrapping original art from traditional art subreddits (r/art, r/painting, r/learntodraw), and artificial images from AI subreddits (r/bigsleep, r/midjourney and r/stablediffusion).
  1. Illluminarty, A Proprietary Model for classifying how likely an artwork is artificial (under the hood, this is also just binary classification if this is human or artificial). This model was trained specifically to detect AI art. The model itself is not available to the public, but is free to use through API and on their site. I ran my original art through their model one by one and noted down their scores on this page.

Setting Up

Our coding environment and dataset

Coding Environment

We will be executing our project in a conda environment on a Python Jupyter Notebook, primarily using PyTorch and HuggingFace libraries accelerate, diffusers, and transformers. You will need a GPU to do this, as accelerate requires at least 1. You could also do it using CPU, but the time to train will be unreasonably long, so I'd advise not to. Make sure you have enough space to download the models we will be downloading (one of them being stable diffusion). To be safe, I would advise having aroun 10GB of free storage for this code, and another 20GB of free storage if you plan to download Mist and use it on your own dataset.

If you don't have Anaconda yet, you can download it for free here. You are free to set up a new conda environment, or use your base environment to install the necessary libraries or start a new one. Launch your Anaconda's Command Prompt or your Terminal and type the following:

$ conda create --name demistified jupyter
$ conda activate demistified
$ pip install -U -qq git+https://github.com/huggingface/diffusers.git
$ pip install -qq accelerate transformers ftfy

You will need to install pytorch and torchvision too, but this command varies depending on your OS. To see what command to use, check PyTorch's website. Be sure to do an installation that sets up a GPU for the Compute Platform (non-CPU). This notebook will be using CUDA.

You may now launch jupyter by typing jupyter notebook in the console to create a new ipykernel notebook and code along. You should be able to run the following block and import all libraries and define all functions.

As an additional step, if you have enough space and would like to skip training the models and go straight to using it in a pipeline, please clone into my github repository:

$ cd <working directory of your notebook>
$ git clone https://github.com/osheets/barot-ml3-indiv.git

You should now have two new files in the same directory as the notebook you are working in, vibes-concept-output, and vibes-misted-concept-output. The new files will be approximately 8GB in total.

In [5]:
# Import the necessary libraries

# For your local environment, you will not need to set up os.environ (but i will)
import os
os.environ['XDG_CACHE_HOME'] = '/home/msds2023/sbarot/.cache'
os.environ['HUGGINGFACE_HUB_CACHE'] = '/home/msds2023/sbarot/.cache'

# Standard Python libraries
import argparse
import itertools
import math
import random
from glob import glob

# For data handling and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# For model training and evaluation
import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from torch.utils.data import Dataset

# For image and text pre-processing, accelerated training, connection to stable diffusion
import PIL
from accelerate import Accelerator
from accelerate.logging import get_logger
from accelerate.utils import set_seed
from diffusers import AutoencoderKL, DDPMScheduler, PNDMScheduler, StableDiffusionPipeline, UNet2DConditionModel
from diffusers.optimization import get_scheduler
from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker
from PIL import Image
from torchvision import transforms
from tqdm.auto import tqdm
from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer
from transformers import pipeline

# Hide warnings (may opt to keep)
import warnings
warnings.filterwarnings("ignore")

def image_grid(imgs, rows, cols):
    """Return a single image of imgs in a rows x column grid"""
    assert len(imgs) == rows*cols

    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size
    
    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

def freeze_params(params):
    """Freeze all params in a model"""
    for param in params:
        param.requires_grad = False
        
# prompt templates for an textual inversion training on an object
imagenet_templates_small = [
    "a photo of a {}",
    "a rendering of a {}",
    "a cropped photo of the {}",
    "the photo of a {}",
    "a photo of a clean {}",
    "a photo of a dirty {}",
    "a dark photo of the {}",
    "a photo of my {}",
    "a photo of the cool {}",
    "a close-up photo of a {}",
    "a bright photo of the {}",
    "a cropped photo of a {}",
    "a photo of the {}",
    "a good photo of the {}",
    "a photo of one {}",
    "a close-up photo of the {}",
    "a rendition of the {}",
    "a photo of the clean {}",
    "a rendition of a {}",
    "a photo of a nice {}",
    "a good photo of a {}",
    "a photo of the nice {}",
    "a photo of the small {}",
    "a photo of the weird {}",
    "a photo of the large {}",
    "a photo of a cool {}",
    "a photo of a small {}",
]

# prompt templates for an textual inversion training on a style
imagenet_style_templates_small = [
    "a painting in the style of {}",
    "a rendering in the style of {}",
    "a cropped painting in the style of {}",
    "the painting in the style of {}",
    "a clean painting in the style of {}",
    "a dirty painting in the style of {}",
    "a dark painting in the style of {}",
    "a picture in the style of {}",
    "a cool painting in the style of {}",
    "a close-up painting in the style of {}",
    "a bright painting in the style of {}",
    "a cropped painting in the style of {}",
    "a good painting in the style of {}",
    "a close-up painting in the style of {}",
    "a rendition in the style of {}",
    "a nice painting in the style of {}",
    "a small painting in the style of {}",
    "a weird painting in the style of {}",
    "a large painting in the style of {}",
]

# Class for Dataset
class TextualInversionDataset(Dataset):
    def __init__(
        self,
        data_root,
        tokenizer,
        learnable_property="object",  # [object, style]
        size=512,
        repeats=100,
        interpolation="bicubic",
        flip_p=0.5,
        set="train",
        placeholder_token="*",
        center_crop=False,
    ):

        self.data_root = data_root
        self.tokenizer = tokenizer
        self.learnable_property = learnable_property
        self.size = size
        self.placeholder_token = placeholder_token
        self.center_crop = center_crop
        self.flip_p = flip_p

        self.image_paths = [os.path.join(self.data_root, file_path) for file_path in os.listdir(self.data_root)]

        self.num_images = len(self.image_paths)
        self._length = self.num_images

        if set == "train":
            self._length = self.num_images * repeats

        self.interpolation = {
            "linear": PIL.Image.LINEAR,
            "bilinear": PIL.Image.BILINEAR,
            "bicubic": PIL.Image.BICUBIC,
            "lanczos": PIL.Image.LANCZOS,
        }[interpolation]

        self.templates = imagenet_style_templates_small if learnable_property == "style" else imagenet_templates_small
        self.flip_transform = transforms.RandomHorizontalFlip(p=self.flip_p)

    def __len__(self):
        return self._length

    def __getitem__(self, i):
        example = {}
        image = Image.open(self.image_paths[i % self.num_images])

        if not image.mode == "RGB":
            image = image.convert("RGB")

        placeholder_string = self.placeholder_token
        text = random.choice(self.templates).format(placeholder_string)

        example["input_ids"] = self.tokenizer(
            text,
            padding="max_length",
            truncation=True,
            max_length=self.tokenizer.model_max_length,
            return_tensors="pt",
        ).input_ids[0]

        # default to score-sde preprocessing
        img = np.array(image).astype(np.uint8)

        if self.center_crop:
            crop = min(img.shape[0], img.shape[1])
            h, w, = (
                img.shape[0],
                img.shape[1],
            )
            img = img[(h - crop) // 2 : (h + crop) // 2, (w - crop) // 2 : (w + crop) // 2]

        image = Image.fromarray(img)
        image = image.resize((self.size, self.size), resample=self.interpolation)

        image = self.flip_transform(image)
        image = np.array(image).astype(np.uint8)
        image = (image / 127.5 - 1.0).astype(np.float32)

        example["pixel_values"] = torch.from_numpy(image).permute(2, 0, 1)
        return example
    
def create_dataloader(train_batch_size=1):
    """Return a shuffled dataloader for a given train_datast"""
    return torch.utils.data.DataLoader(train_dataset, batch_size=train_batch_size, shuffle=True)


logger = get_logger(__name__)

def save_progress(text_encoder, placeholder_token_id, accelerator, save_path):
    """Checkpoints and saves latest learned embeddings"""
    logger.info("Saving embeddings")
    learned_embeds = accelerator.unwrap_model(text_encoder).get_input_embeddings().weight[placeholder_token_id]
    learned_embeds_dict = {placeholder_token: learned_embeds.detach().cpu()}
    torch.save(learned_embeds_dict, save_path)

def training_function(text_encoder, vae, unet):
    """Textual Inversion training function that needs to be passed to the accelerator"""
    train_batch_size = hyperparameters["train_batch_size"]
    gradient_accumulation_steps = hyperparameters["gradient_accumulation_steps"]
    learning_rate = hyperparameters["learning_rate"]
    max_train_steps = hyperparameters["max_train_steps"]
    output_dir = hyperparameters["output_dir"]
    gradient_checkpointing = hyperparameters["gradient_checkpointing"]

    accelerator = Accelerator(
        gradient_accumulation_steps=gradient_accumulation_steps,
        mixed_precision=hyperparameters["mixed_precision"]
    )

    if gradient_checkpointing:
        text_encoder.gradient_checkpointing_enable()
        unet.enable_gradient_checkpointing()

    train_dataloader = create_dataloader(train_batch_size)

    if hyperparameters["scale_lr"]:
        learning_rate = (
            learning_rate * gradient_accumulation_steps * train_batch_size * accelerator.num_processes
        )

    # Initialize the optimizer
    optimizer = torch.optim.AdamW(
        text_encoder.get_input_embeddings().parameters(),  # only optimize the embeddings
        lr=learning_rate,
    )

    text_encoder, optimizer, train_dataloader = accelerator.prepare(
        text_encoder, optimizer, train_dataloader
    )

    weight_dtype = torch.float32
    if accelerator.mixed_precision == "fp16":
        weight_dtype = torch.float16
    elif accelerator.mixed_precision == "bf16":
        weight_dtype = torch.bfloat16

    # Move vae and unet to device
    vae.to(accelerator.device, dtype=weight_dtype)
    unet.to(accelerator.device, dtype=weight_dtype)

    # Keep vae in eval mode as we don't train it
    vae.eval()
    # Keep unet in train mode to enable gradient checkpointing
    unet.train()

    
    # We need to recalculate our total training steps as the size of the training dataloader may have changed.
    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / gradient_accumulation_steps)
    num_train_epochs = math.ceil(max_train_steps / num_update_steps_per_epoch)

    # Train!
    total_batch_size = train_batch_size * accelerator.num_processes * gradient_accumulation_steps

    logger.info("***** Running training *****")
    logger.info(f"  Num examples = {len(train_dataset)}")
    logger.info(f"  Instantaneous batch size per device = {train_batch_size}")
    logger.info(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")
    logger.info(f"  Gradient Accumulation steps = {gradient_accumulation_steps}")
    logger.info(f"  Total optimization steps = {max_train_steps}")
    # Only show the progress bar once on each machine.
    progress_bar = tqdm(range(max_train_steps), disable=not accelerator.is_local_main_process)
    progress_bar.set_description("Steps")
    global_step = 0

    for epoch in range(num_train_epochs):
        text_encoder.train()
        for step, batch in enumerate(train_dataloader):
            with accelerator.accumulate(text_encoder):
                # Convert images to latent space
                latents = vae.encode(batch["pixel_values"].to(dtype=weight_dtype)).latent_dist.sample().detach()
                latents = latents * 0.18215

                # Sample noise that we'll add to the latents
                noise = torch.randn_like(latents)
                bsz = latents.shape[0]
                # Sample a random timestep for each image
                timesteps = torch.randint(0, noise_scheduler.num_train_timesteps, (bsz,), device=latents.device).long()

                # Add noise to the latents according to the noise magnitude at each timestep
                # (this is the forward diffusion process)
                noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)

                # Get the text embedding for conditioning
                encoder_hidden_states = text_encoder(batch["input_ids"])[0]

                # Predict the noise residual
                noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states.to(weight_dtype)).sample

                 # Get the target for loss depending on the prediction type
                if noise_scheduler.config.prediction_type == "epsilon":
                    target = noise
                elif noise_scheduler.config.prediction_type == "v_prediction":
                    target = noise_scheduler.get_velocity(latents, noise, timesteps)
                else:
                    raise ValueError(f"Unknown prediction type {noise_scheduler.config.prediction_type}")

                loss = F.mse_loss(noise_pred, target, reduction="none").mean([1, 2, 3]).mean()
                accelerator.backward(loss)

                # Zero out the gradients for all token embeddings except the newly added
                # embeddings for the concept, as we only want to optimize the concept embeddings
                if accelerator.num_processes > 1:
                    grads = text_encoder.module.get_input_embeddings().weight.grad
                else:
                    grads = text_encoder.get_input_embeddings().weight.grad
                # Get the index for tokens that we want to zero the grads for
                index_grads_to_zero = torch.arange(len(tokenizer)) != placeholder_token_id
                grads.data[index_grads_to_zero, :] = grads.data[index_grads_to_zero, :].fill_(0)

                optimizer.step()
                optimizer.zero_grad()

            # Checks if the accelerator has performed an optimization step behind the scenes
            if accelerator.sync_gradients:
                progress_bar.update(1)
                global_step += 1
                if global_step % hyperparameters["save_steps"] == 0:
                    save_path = os.path.join(output_dir, f"learned_embeds-step-{global_step}.bin")
                    save_progress(text_encoder, placeholder_token_id, accelerator, save_path)

            logs = {"loss": loss.detach().item()}
            progress_bar.set_postfix(**logs)

            if global_step >= max_train_steps:
                break

        accelerator.wait_for_everyone()


    # Create the pipeline using using the trained modules and save it.
    if accelerator.is_main_process:
        pipeline = StableDiffusionPipeline.from_pretrained(
            pretrained_model_name_or_path,
            text_encoder=accelerator.unwrap_model(text_encoder),
            tokenizer=tokenizer,
            vae=vae,
            unet=unet,
        )
        pipeline.save_pretrained(output_dir)
        # Also save the newly trained embeddings
        save_path = os.path.join(output_dir, f"learned_embeds.bin")
        save_progress(text_encoder, placeholder_token_id, accelerator, save_path)
2023-06-12 13:57:03.256015: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

Dataset¶

This is a digital image I drew in 2020. Each of the 8 squares represent concepts I associate with my friends, also known as their vibe. All of them are artists, some hobbyists and some professional! Because the art is mine, I give myself consent to use it for this project. Do not use someone else's artwork without consent if you plan to imitate this project and share it.

Textual inversion requires very few samples, and so 8 should suffice for training. I picked this for its consistent render style, but very different subject matters to see how the model would perform on images with non-repetitive examples but a consistent render technique. You can find the images for this in data/Vibes.

Friend_vibes.png

friend vibe
el yellow brush strokes on a canvas
cyrus the cyan cartridge of a laser printer
madeline a dim room illuminated by the light of a pink sky
kuggs a misty forest filled with coniferous trees
yana a glass of dark red wine
margot a warm purple sunset in an empty field
juice golden syrup poured on top of a stack of pankcakes
jara a rocky sea shore with crashing waves on an overcast sky
In [6]:
# Load the images
original_images = sorted(glob('data/Vibes/*.png'))
original_image_files = []
for original_image in original_images:
    original_image_files.append(Image.open(original_image).resize((256, 256)))

image_grid(original_image_files, 1, len(original_image_files))
Out[6]:

Original Art to Artificial Art

Copy someone's style using textual inversion

image.png

Training

The following section will show how to train the model using textual inversion. This process took about 7 hours on one GPU. The trained model is already available in vibes-concept-output, so you may skip to the results to load this up instead of starting from scratch.

Setting up Hyperparameters. Our model will train with a learning rate of 0.0005 for 2000 steps with fp16 precision (change this to "no" if you are using CPU). The model will checkpoint every 250 steps, and train 4 at a time. The final model will save to vibes-concept-output If your GPU lacks space even after emptying cache, try reducing the train_batch_size and increasing max_train_steps.

The following codeblock will be used to set the variables for training. Here are a few key ones that you need to understand: | variable | description | value | | - | - | - | | pretrained_model_name_or_path | the name of the image generation model on HuggingFace | stabilityai/stable-diffusion-2 | | what_to_teach | the type of concept (object or style) -- changes the prompt phrasing | style | | placeholder_token | the name of your novel concept, *S. by convention, use angular brackets and dashes to avoid overwritting existing embeddings | \ | | initializer_token | word that can summarise what your new concept is, to be used as a starting point | painting | | save_path | directory of where your original image files are | ./data/Vibes

In [7]:
pretrained_model_name_or_path = "stabilityai/stable-diffusion-2"
what_to_teach = "style"
placeholder_token = "<stevie-barot>"
initializer_token = "painting" 
save_path = './data/Vibes'

hyperparameters = {
    "learning_rate": 5e-04,
    "scale_lr": True,
    "max_train_steps": 2000,
    "save_steps": 250,
    "train_batch_size": 4,
    "gradient_accumulation_steps": 1,
    "gradient_checkpointing": True,
    "mixed_precision": "fp16",
    "seed": 42,
    "output_dir": "vibes-concept-output"
}

# !mkdir -p vibes-concept-output

Setting up Tokenizer. We load a tokenizer using CLIPTokenizer and add our placeholder_token, <stevie-barot>. This block adds it to the tokenizer and makes sure that the token we used is only 1 string (no spacing). After that, we store the ID for the initializer_token, painting and the placeholder_token <stevie-barot>.

In [8]:
# ~1.5MB
# Load the tokenizer and add the placeholder token as a additional special token.
tokenizer = CLIPTokenizer.from_pretrained(
    pretrained_model_name_or_path,
    subfolder="tokenizer",
)

# Add the placeholder token in tokenizer
num_added_tokens = tokenizer.add_tokens(placeholder_token)
if num_added_tokens == 0:
    raise ValueError(
        f"The tokenizer already contains the token {placeholder_token}. Please pass a different"
        " `placeholder_token` that is not already in the tokenizer."
    )
    
#Get token ids for our placeholder and initializer token
# This code block will complain if initializer string is not a single token

# Convert the initializer_token, placeholder_token to ids
token_ids = tokenizer.encode(initializer_token, add_special_tokens=False)
# Check if initializer_token is a single token or a sequence of tokens
if len(token_ids) > 1:
    raise ValueError("The initializer token must be a single token.")

initializer_token_id = token_ids[0]
placeholder_token_id = tokenizer.convert_tokens_to_ids(placeholder_token)

Setting up Model and Text and Image Encoders. Stable Diffusion's model uses CLIPTextModel, AudoEncoderKL, and UNet2dConditionModel. We download these by calling them from HuggingFace using the models trained on Stable Diffusion. This will also download a portion of Stable Diffusion.

In [9]:
#Load the Stable Diffusion model, ~4.5GB
# Load models and create wrapper for stable diffusion

# pipeline = StableDiffusionPipeline.from_pretrained(pretrained_model_name_or_path)
# del pipeline

text_encoder = CLIPTextModel.from_pretrained(
    pretrained_model_name_or_path, subfolder="text_encoder"
)
vae = AutoencoderKL.from_pretrained(
    pretrained_model_name_or_path, subfolder="vae"
)
unet = UNet2DConditionModel.from_pretrained(
    pretrained_model_name_or_path, subfolder="unet"
)

Add new embedding. The text encoder has to resize itself to our previously defined tokenizer. We replace the placeholder_token_id with our initializer_token_id.

In [10]:
text_encoder.resize_token_embeddings(len(tokenizer))
token_embeds = text_encoder.get_input_embeddings().weight.data
token_embeds[placeholder_token_id] = token_embeds[initializer_token_id]

Freeze model for feature extraction. The parameters in the model are completely frozen.

In [11]:
# Freeze vae and unet
freeze_params(vae.parameters())
freeze_params(unet.parameters())
# Freeze all parameters except for the token embeddings in text encoder
params_to_freeze = itertools.chain(
    text_encoder.text_model.encoder.parameters(),
    text_encoder.text_model.final_layer_norm.parameters(),
    text_encoder.text_model.embeddings.position_embedding.parameters(),
)
freeze_params(params_to_freeze)

Prepare dataset. We initialize a dataset class ready to repeat data loading 100 times for training with our given tokenizer.

In [12]:
# Prepare dataset
train_dataset = TextualInversionDataset(
      data_root=save_path,
      tokenizer=tokenizer,
      size=vae.config.sample_size,
      placeholder_token=placeholder_token,
      repeats=100,
      learnable_property=what_to_teach,
      center_crop=False,
      set="train")

Setting up scheduler. We use a DDPM scheduler to put timestamps when returning denoised samples of training outputs.

In [13]:
# 345B
noise_scheduler = DDPMScheduler.from_config(pretrained_model_name_or_path, subfolder="scheduler")

Train with accelerate. We launch accelerate with our training_function defined in the beginning, with one GPU. change num_processes if you have more than one GPU. 2000 steps took about 7 hours. You are free to skip this and read the trained model already, which is available in vibes-concept-output.

In [14]:
# import accelerate

# accelerate.notebook_launcher(training_function, num_processes=1, args=(text_encoder, vae, unet))

# for param in itertools.chain(unet.parameters(), text_encoder.parameters()):
#     if param.grad is not None:
#         del param.grad  # free some memory
#     torch.cuda.empty_cache()

Generated Image Results

We now read the trained model using through a StableDiffusionPipeline, pretrained by vibes-concept-output. We need a scheduler too. In this case, we use DPMSolverMultistepScheduler.

In [15]:
# Set up the pipeline 
from diffusers import DPMSolverMultistepScheduler

pipe = StableDiffusionPipeline.from_pretrained(
    "vibes-concept-output",
    scheduler=DPMSolverMultistepScheduler.from_pretrained("vibes-concept-output", subfolder="scheduler"),
    torch_dtype=torch.float16,
).to("cuda")

Setting up image prompts. We can now use our token, <stevie-barot> in our stable diffusion prompts. Because it is a style, we phrase our prompts with "in the style of...". I did my best to create prompts that describe the initial feelings of the art I fed.

In [16]:
prompts =  ["an image in the style of <stevie-barot>",
         "a misty forest with coniferous trees in the style of <stevie-barot>",
         "a warm purple sunset in the style of <stevie-barot>",
         "a plate of stacked pancakes with maple syrup in the style of <stevie-barot>",
         "a glass of red wine in the style of <stevie-barot>",
         "a room with the shadow of a window with a pink sunset in the style of <stevie-barot>",
         "a rocky sea shore with high waves in the style of <stevie-barot>",
         "a laser printer in the style of <stevie-barot>",
         "a paint brush painting a canvas with yellow paint in the style of <stevie-barot>"
        ]

Running diffusion pipeline. We can now call our model multiple times to output as many image samples as we want given our prompts. Because diffusion models are probabalistic and hard to replicate, I've commented out the block below. The model will generate very different images for you than it did for me, but go ahead and try it if you want to overwrite the images already saved in data/Vibes_generated.

In [17]:
# #Run the Stable Diffusion pipeline

# num_samples = 4
# num_rows = 1

# all_prompt_images = {}

# for prompt in prompts: 
#     all_images = []
#     for _ in range(num_rows):
#         images = pipe([prompt] * num_samples, num_inference_steps=30, guidance_scale=7.5).images
#         all_images.extend(images)
#     all_prompt_images[prompt] = all_images
    
# !mkdir -p data/Vibes_generated

# for p, prompt_images in enumerate(list(all_prompt_images.values())):
#     for i, im in enumerate(prompt_images):
#         im.save(f'data/Vibes_generated/{p}_{i}.png')

Show results. The images with more general landscape I would say were most well-imitated. The style transfer is most apparent in Kuggs and Margot's vibes (the forest and the sunset). Images with more objects in them, like Juice and Cyrus's (the pancake and the printer) ended up being more photorealistic without any semblance to the original image.

In [18]:
prompt_dict = {i: p for i, p in enumerate(prompts)}
all_prompt_images = {p: [] for p in prompts}

for i in glob('data/Vibes_generated/*.png'):
    p = int(os.path.basename(i)[0])
    all_prompt_images[prompt_dict[p]].append(Image.open(i))

original_dict = {i: j for i, j in zip([7, 8, 6, 3, 1, 5, 2, 4], original_images)}
original_dict = dict(zip([prompt_dict[o] for o in original_dict], list(original_dict.values())))

for prompt in list(all_prompt_images.keys())[1:]:
    fig, ax = plt.subplots(1, 2, figsize=(10,5), dpi=100)
    ax[0].imshow(Image.open(original_dict[prompt]))
    ax[0].set_xticks([])
    ax[0].set_yticks([])
    ax[1].imshow(image_grid(all_prompt_images[prompt], 2, 2))
    ax[1].set_xticks([])
    ax[1].set_yticks([])
    fig.suptitle(f'"{prompt}""')
    fig.tight_layout()
    plt.show()

Original Art to Misted Art

Processing images to become untrainable on stable diffusion

image.png

For this dataset, you can find the already misted artworks in data/Vibes_misted.

A version of Mist is available for Windows and Linux. The code is also available for download on Github, but I would warn that the storage requirements to download and run Mist are quite high. The file in total is around 8GB, and when unzipped, will add another 11GB to your hardware space. Not just that, Mist will run into errors or straight up not run if your computer does not have a GPU that it can access. Mist will also take up a lot of the GPU's memory. Mist takes about 1-3 minutes to process an image into a resized 256x256px image alone. Needless to say, it is not yet optimized.

I did the work of misting my original art on my own PC as it is really not efficient to execute from a Jupyter Notebook. If you would like to use Mist, here is a page on Mist's documentation that shows you the steps on how to manually download it from HuggingFace or set an conda environment on it after cloning from GitHub. I would advise doing the former, because downloading the file directly will already properly set the environment. From there, you can launch an executable to open a temporary gradio space that lets you watermark as many images as your PC can handle.

In [19]:
original_misted_images = sorted(glob('data/Vibes_misted/*.png'))
original_misted_image_files = []
for original_image in original_misted_images:
    original_misted_image_files.append(Image.open(original_image).resize((256, 256)))
print('original images')
display(image_grid(original_image_files, 1, len(original_image_files)))
print('misted images')
display(image_grid(original_misted_image_files, 1, len(original_misted_image_files)))
original images
misted images

While not super visible, you can see especially on whiter textures that there is now a low opacity texture added to the drawing.

Misted Art to Misted Artificial Art

Messing up textual inversion training using misted images

image.png

Training¶

Now that we have our misted iamges, let's see how it changes the model with virtually the same training pipeline. The only parameters we have changed here is the placeholder_token, which we have now named <stevie-misted>, the save_path of the misted images, ./data/Vibes_misted, and the output directory for the final model, vibes-misted-concept-output.

In [20]:
pretrained_model_name_or_path = "stabilityai/stable-diffusion-2"
what_to_teach = "style" # object or style
placeholder_token = "<stevie-misted>" # used to represent your new concept
initializer_token = "painting" #word that can summarise what your new concept is, to be used as a starting point
save_path = './data/Vibes_misted'

hyperparameters = {
    "learning_rate": 5e-04,
    "scale_lr": True,
    "max_train_steps": 4000,
    "save_steps": 250,
    "train_batch_size": 2,
    "gradient_accumulation_steps": 1,
    "gradient_checkpointing": True,
    "mixed_precision": "fp16",
    "seed": 42,
    "output_dir": "vibes-misted-concept-output"
}

# !mkdir -p vibes-misted-concept-output

The block below repeats the same steps as the textual inversion we did for our original art. The tokenizer is initialized with our placeholder, the model, text, and image encoders are called, the encoder is resized, the model parameters are frozen, the dataset is initialized, and the scheduler is defined.

In [21]:
# ~1.5MB
# Load the tokenizer and add the placeholder token as a additional special token.
tokenizer = CLIPTokenizer.from_pretrained(
    pretrained_model_name_or_path,
    subfolder="tokenizer",
)

# Add the placeholder token in tokenizer
num_added_tokens = tokenizer.add_tokens(placeholder_token)
if num_added_tokens == 0:
    raise ValueError(
        f"The tokenizer already contains the token {placeholder_token}. Please pass a different"
        " `placeholder_token` that is not already in the tokenizer."
    )
    
#Get token ids for our placeholder and initializer token
# This code block will complain if initializer string is not a single token

# Convert the initializer_token, placeholder_token to ids
token_ids = tokenizer.encode(initializer_token, add_special_tokens=False)
# Check if initializer_token is a single token or a sequence of tokens
if len(token_ids) > 1:
    raise ValueError("The initializer token must be a single token.")

initializer_token_id = token_ids[0]
placeholder_token_id = tokenizer.convert_tokens_to_ids(placeholder_token)

#Load the Stable Diffusion model, ~4.5GB
# Load models and create wrapper for stable diffusion

# pipeline = StableDiffusionPipeline.from_pretrained(pretrained_model_name_or_path)
# del pipeline

text_encoder = CLIPTextModel.from_pretrained(
    pretrained_model_name_or_path, subfolder="text_encoder"
)
vae = AutoencoderKL.from_pretrained(
    pretrained_model_name_or_path, subfolder="vae"
)
unet = UNet2DConditionModel.from_pretrained(
    pretrained_model_name_or_path, subfolder="unet"
)

text_encoder.resize_token_embeddings(len(tokenizer))
token_embeds = text_encoder.get_input_embeddings().weight.data
token_embeds[placeholder_token_id] = token_embeds[initializer_token_id]

# Freeze vae and unet
freeze_params(vae.parameters())
freeze_params(unet.parameters())
# Freeze all parameters except for the token embeddings in text encoder
params_to_freeze = itertools.chain(
    text_encoder.text_model.encoder.parameters(),
    text_encoder.text_model.final_layer_norm.parameters(),
    text_encoder.text_model.embeddings.position_embedding.parameters(),
)
freeze_params(params_to_freeze)

train_dataset = TextualInversionDataset(
      data_root=save_path,
      tokenizer=tokenizer,
      size=vae.config.sample_size,
      placeholder_token=placeholder_token,
      repeats=100,
      learnable_property=what_to_teach, #Option selected above between object and style
      center_crop=False,
      set="train")

# 345B
noise_scheduler = DDPMScheduler.from_config(pretrained_model_name_or_path, subfolder="scheduler")

Because I ran out of GPU space for this, I trained this model for 4000 steps with a batch_size of 2. The previous model trained for 2000 steps with a batch_size of 4. Similarly, this training process took around 7 hours on 1 GPU. You may run the code below to start this training again, or skip to results and read the already trained model at vibes-misted-concept-output.

In [22]:
# # may need to reduce batch_size and increase max_train_steps to allocate space for GPU
# import accelerate

# accelerate.notebook_launcher(training_function, num_processes=1, args=(text_encoder, vae, unet))

# for param in itertools.chain(unet.parameters(), text_encoder.parameters()):
#     if param.grad is not None:
#         del param.grad  # free some memory
#     torch.cuda.empty_cache()

Generated Image Results¶

The model is read from vibes-misted-concept-output. We use the same prompts, but change the token to <stevie-misted> and get 4 samples for each prompt.

In [25]:
# Set up the pipeline -- may need to restart to reallocate cuda
from diffusers import DPMSolverMultistepScheduler

pipe = StableDiffusionPipeline.from_pretrained(
    "vibes-misted-concept-output",
    scheduler=DPMSolverMultistepScheduler.from_pretrained(
        "vibes-misted-concept-output", subfolder="scheduler"),
    torch_dtype=torch.float16,
).to("cuda")
In [26]:
misted_prompts = ["an image in the style of <stevie-misted>",
                  "a misty forest with coniferous trees in the style of <stevie-misted>",
                  "a warm purple sunset in the style of <stevie-misted>",
                  "a plate of stacked pancakes with maple syrup in the style of <stevie-misted>",
                  "a glass of red wine in the style of <stevie-misted>",
                  "a room with the shadow of a window with a pink sunset in the style of <stevie-misted>",
                  "a rocky sea shore with high waves in the style of <stevie-misted>",
                  "a laser printer in the style of <stevie-misted>",
                  "a paint brush painting a canvas with yellow paint in the style of <stevie-misted>"
                  ]

Because the pipeline is probabalistic, running the code below will generate images dissimilar to mine. You are free to run it for your own examples and overwrite the images that are already in Vibes_misted_generated if you want, or skip this and proceed to reading the already processed images.

In [27]:
# #Run the Stable Diffusion pipeline
# num_samples = 4
# num_rows = 1

# all_misted_prompt_images = {}

# for prompt in misted_prompts: 
#     all_images = []
#     for _ in range(num_rows):
#         images = pipe([prompt] * num_samples, num_inference_steps=30, guidance_scale=7.5).images
#         all_images.extend(images)
#     all_misted_prompt_images[prompt] = all_images
    
# !mkdir -p data/Vibes_misted_generated

# for p, prompt_images in enumerate(list(all_misted_prompt_images.values())):
#     for i, im in enumerate(prompt_images):
#         im.save(f'data/Vibes_misted_generated/{p}_{i}.png')

The results are interesting. The only image that seemed unaffected is Kuggs's (the forest), producing cohesive results. Images like Yana and Margot's (the wine and the visit) have visibly disruptive patterns. While not all results are senseless beyond comprehension, the quality of the images has been somewhat impaired compared to the original model. Misting does have an effect on the model.

In [28]:
original_images = glob('data/Vibes_misted/*.png')
misted_prompt_dict = {i: p for i, p in enumerate(misted_prompts)}
all_misted_prompt_images = {p: [] for p in misted_prompts}

for i in glob('data/Vibes_misted_generated/*.png'):
    p = int(os.path.basename(i)[0])
    all_misted_prompt_images[misted_prompt_dict[p]].append(Image.open(i))

original_misted_dict = {i: j for i, j in zip([7, 1, 5, 6, 3, 2, 4, 8], original_images)}
original_misted_dict = dict(zip([misted_prompt_dict[o] for o in original_misted_dict],
                                list(original_misted_dict.values())))

for prompt in list(all_misted_prompt_images.keys())[1:]:
    fig, ax = plt.subplots(1, 2, figsize=(10,5), dpi=100)
    ax[0].imshow(Image.open(original_misted_dict[prompt]))
    ax[0].set_xticks([])
    ax[0].set_yticks([])
    ax[1].imshow(image_grid(all_misted_prompt_images[prompt], 2, 2))
    ax[1].set_xticks([])
    ax[1].set_yticks([])
    fig.suptitle(f'"{prompt}""')
    fig.tight_layout()
    plt.show()

Art Detection

Is there a line between real and fake?

We will try detecting whether our images are real or artificial using umm-maybe and Illuminarty's detector models. We import the AI image detector by umm-maybe using pipeline from HuggingFace using the code below. As for the Illuminarty scores, as the model is proprietary, the scores were collected manually by inputting each image into their site, which can be found here.

In [29]:
# 331MB
ai_detector = pipeline('image-classification', model='umm-maybe/AI-image-detector')
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
In [30]:
accuracies = {}
original_scores = []

# Detecting Original Art
for original_image in original_images:
    original_score = ai_detector(original_image)
    prediction = original_score[0]['label']
    prediction_score = original_score[0]['score']
    original_scores.append([original_image, prediction, prediction_score])
    
original_df = pd.DataFrame(original_scores, columns=['filepath', 'prediction', 'score'])
original_df['score'] = (original_df['score'] * 100).round(1)
original_df['illuminarty prediction'] = ['human'] * 8
original_df['illuminarty score'] = [100-x for x in [0.2, 0.0, 0.5, 1.5, 0.0, 0.1, 1.0, 0.2]]

original_preds = len(original_df[original_df['prediction'] == 'human'])
illuminarty_original_preds = len(original_df[original_df['illuminarty prediction'] == 'human'])
original_df['type'] = ['Original Art'] * len(original_df)

correct = (original_preds / len(original_df)) * 100
ill_correct = (illuminarty_original_preds / len(original_df)) * 100

# print(f"{correct:.2f}% Correct")
# print(f"{ill_correct:.2f}% Illuminarty Correct")

accuracies['Original Art umm-maybe'] = correct
accuracies['Original Art Illuminarty'] = ill_correct

# Detecting Artificial Art
artificial_images = glob('data/Vibes_generated/*.png')
artificial_scores = []

for artificial_image in artificial_images:
    artificial_score = ai_detector(artificial_image)
    prediction = artificial_score[0]['label']
    prediction_score = artificial_score[0]['score']
    artificial_scores.append([artificial_image, prediction, prediction_score])

artificial_df = pd.DataFrame(artificial_scores, columns=['filepath', 'prediction', 'score'])

artificial_preds = len(artificial_df[artificial_df['prediction'] == 'artificial'])

illuminarty_artificial_score = [86.3, 93.0, 95.1, 89.9,
                                58.2, 90.7, 64.5, 84.5,
                                88.8, 98.4, 89.3, 87.5,
                                46.7, 37.1, 87.3, 82.5,
                                71.1, 25.9, 84.1, 60.2,
                                97.1, 73.5, 40.9, 77.8,
                                42.3, 68.2, 64.4, 19.2,
                                48.7, 91.88, 75.3, 99.2,
                                76.3, 80.2, 91.8, 40.7]

illuminarty_prediction = ['artificial' if i > 50.0 else 'human' for i in illuminarty_artificial_score]
illuminarty_score = [i if i > 50.0 else 100-i for i in illuminarty_artificial_score]

artificial_df['illuminarty prediction'] = illuminarty_prediction
artificial_df['illuminarty score'] = illuminarty_score
artificial_df['type'] = ['Artificial Art'] * len(artificial_df)

illuminarty_artificial_preds = len(artificial_df[artificial_df['illuminarty prediction'] == 'artificial'])
correct = (artificial_preds / len(artificial_df)) * 100
ill_correct = (illuminarty_artificial_preds / len(artificial_df)) * 100

accuracies['Artificial Art umm-maybe'] = correct
accuracies['Artificial Art Illuminarty'] = ill_correct

# print(f"{correct:.2f}% Correct")
# print(f"{ill_correct:.2f}% Illuminarty Correct")

# Detecting Misted Art
misted_images = glob('data/Vibes_misted/*.png')

misted_scores = []

for misted_image in misted_images:
    misted_score = ai_detector(misted_image)
    prediction = misted_score[0]['label']
    prediction_score = misted_score[0]['score']
    misted_scores.append([misted_image, prediction, prediction_score])
    
misted_df = pd.DataFrame(misted_scores, columns=['filepath', 'prediction', 'score'])
misted_df['score'] = (misted_df['score'] * 100).round(1)
misted_df['illuminarty prediction'] = ['human'] * 8
misted_df['illuminarty score'] = [100-x for x in [0.2, 0.0, 0.5, 1.5, 0.0, 0.1, 1.0, 0.2]]
misted_df['type'] = ['Misted Art'] * len(misted_df)

misted_preds = len(misted_df[misted_df['prediction'] == 'human'])
illuminarty_misted_preds = len(misted_df[misted_df['illuminarty prediction'] == 'human'])

correct = (misted_preds / len(misted_df)) * 100
ill_correct = (illuminarty_misted_preds / len(misted_df)) * 100

# print(f"{correct:.2f}% Correct")
# print(f"{ill_correct:.2f}% Illuminarty Correct")

accuracies['Misted Art umm-maybe'] = correct
accuracies['Misted Art Illuminarty'] = ill_correct

# Detecting Artificial Misted Art
artificial_misted_images = glob('data/Vibes_misted_generated/*.png')
artificial_misted_scores = []

for artificial_image in artificial_misted_images:
    artificial_score = ai_detector(artificial_image)
    prediction = artificial_score[0]['label']
    prediction_score = artificial_score[0]['score']
    artificial_misted_scores.append([artificial_image, prediction, prediction_score])

artificial_misted_df = pd.DataFrame(artificial_misted_scores, columns=['filepath', 'prediction', 'score'])

artificial_misted_preds = len(artificial_misted_df[artificial_misted_df['prediction'] == 'artificial'])

illuminarty_artificial_misted_score = [100.0, 82.7, 83.2, 65.2,
                                       92.6, 17.6, 87.0, 69.6,
                                       99.4, 97.0, 72.9, 93.2,
                                       67.9, 14.2, 26.1, 81.6,
                                       16.6, 33.6, 3.8, 29.3,
                                       91.1, 82.6, 82.9, 93.8,
                                       51.7, 29.8, 77.4, 30.4,
                                       38.9, 81.5, 97.9, 64.4,
                                       56.8, 2.5, 92.9, 24.1
                                      ]

illuminarty_misted_prediction = ['artificial' if i > 50.0 else 'human' for i in illuminarty_artificial_misted_score]
illuminarty_misted_score = [i if i > 50.0 else 100-i for i in illuminarty_artificial_misted_score]

artificial_misted_df['illuminarty prediction'] = illuminarty_misted_prediction
artificial_misted_df['illuminarty score'] = illuminarty_misted_prediction
artificial_misted_df['type'] = ['Artificial Misted Art'] * len(artificial_misted_df)

correct = (artificial_misted_preds / len(artificial_misted_df)) * 100
illuminarty_artificial_misted_preds = len(artificial_misted_df[artificial_misted_df['illuminarty prediction'] == 'artificial'])
ill_correct = (illuminarty_artificial_misted_preds / len(artificial_misted_df)) * 100

accuracies['Artificial Misted Art umm-maybe'] = correct
accuracies['Aritificial Misted Art Illuminarty'] = ill_correct

# print(f"{correct:.2f}% Correct")
# print(f"{ill_correct:.2f}% Illuminarty Correct")

Misting original images does not mess with the detector. We can see that the both detectors have 100% accuracy in detecting that original and misted art are human-made. An artwork with mist will rarely be mistaken as artificial just because of the edits it underwent.

Illuminarty performs better on artificial data. In both cases, Illuminarty correctly detected that artificial art was artificial than umm-maybe.

Images generated from misted images affect AI detection accuracy. Illuminarty's accuracy decreased for images generated on a misted image-trained model. This may be because the disrupted art created from misting are atypical from common AI-generated artworks.

In [31]:
fig, ax = plt.subplots(figsize=(12, 6), dpi=100)
pd.DataFrame(accuracies, index=['Accuracies']).T.sort_values('Accuracies').plot.barh(
    title='AI Image Detector Accuracy per art type (in percent)',
    color='#8975F0',
    ax=ax
)
ax.set_xlim([0, 100]);

Illuminarty is superior. Illuminarty in general is more accurate than umm-maybe when collectively looking at original, artificial, misted, and misted artificial images.

In [32]:
fig, ax = plt.subplots(figsize=(12, 6), dpi=100)

df = pd.concat([original_df, artificial_df, misted_df, artificial_misted_df])
df['actual'] = ['human' if (i == 'Original Art') or (i == 'Misted Art') else 'artificial' for i in df['type']]

correct = len(df[df['actual'] == df['prediction']]) / len(df)
ill_correct = len(df[df['actual'] == df['illuminarty prediction']]) / len(df)

pd.DataFrame({'umm-maybe accuracy': correct * 100, 'illuminarty accuracy': ill_correct * 100},
             index=['Accuracies']).T.sort_values('Accuracies').plot.barh(title='Accuracy per method (in percent)',
                                                                         color='#8975F0',
                                                                         ax=ax
                                                                        )

ax.set_xlim([0, 100]);

So Does it Work?

An assessment of our tools

Based on this project, we've seen that:

Mist is relatively successful in disrupting textual inversion in Stable Diffusion

Relatively good for specific prompts. When generating specific prompts like the ones below, we can see in some cases that the texture of the outputs is ruined. While not completely ruined, these images are no longer high quality or presentable, due to something disrupting it. The best example of this is the sunset, which ends up grainy and hazy.

In [33]:
artificial_images = np.array_split(sorted(glob('data/Vibes_generated/*.png')), 9)
artificial_image_files = []
for artificial_array in artificial_images:
    artificial_array_files = []
    for artificial_image in artificial_array:
        artificial_array_files.append(Image.open(artificial_image).resize((256, 256)))
    artificial_image_files.append(artificial_array_files)
artificial_image_files = [image_grid(a, 2, 2) for a in artificial_image_files]
display(image_grid(np.array(artificial_image_files)[[7, 8, 6, 3, 1, 5, 2, 4]], 1, 8))

artificial_misted_images = np.array_split(sorted(glob('data/Vibes_misted_generated/*.png')), 9)
artificial_misted_image_files = []
for artificial_misted_array in artificial_misted_images:
    artificial_misted_array_files = []
    for artificial_misted_image in artificial_misted_array:
        artificial_misted_array_files.append(Image.open(artificial_misted_image).resize((256, 256)))
    artificial_misted_image_files.append(artificial_misted_array_files)
artificial_misted_image_files = [image_grid(a, 2, 2) for a in artificial_misted_image_files]
display(image_grid(np.array(artificial_misted_image_files)[[7, 8, 6, 3, 1, 5, 2, 4]], 1, 8))

Really good for general prompts. Mist's strength is when the prompt is extremely broad. In this example, I used the prompt, "An image in the style of...", a very general prompt. Proof of Mist's work is in how the mist-trained model can barely generate anything cohesive. The images are muddled with textures, and the contents of those images barely form a cohesive picture. At the very least, the lazy prompter will be inhibited by Mist's effects as it stands now.

In [34]:
fig, ax = plt.subplots(1, 2, figsize=(10,5), dpi=100)
ax[0].imshow(image_grid(all_prompt_images['an image in the style of <stevie-barot>'], 2, 2))
ax[0].set_title('<stevie-barot>')
ax[0].set_xticks([])
ax[0].set_yticks([])

ax[1].imshow(image_grid(all_misted_prompt_images['an image in the style of <stevie-misted>'], 2, 2))
ax[1].set_title('<stevie-misted>')
ax[1].set_xticks([])
ax[1].set_yticks([])

fig.suptitle('"an image in the style of..."')
fig.tight_layout()
plt.show()

Computationally and aesthetically expensive. Mist gets the job done, but the specs required to download it and run it are not the most accessible. Mist's filters are also effective, but at the cost of losing some of the original image's fidelity. For some images this in imperceptible, but in some art (especially art in very simple styles), the textural difference is very noticeable.

Illuminarty is relatively good at detecting AI art

Illuminarty has ~77% accuracy on art detection considering original, artificial, misted, and artificial mimsted art samples.

umm-maybe is meant for realistic images. Being trained on AI subreddits, umm-maybe seems to be more attuned to the artificial jank present in realistic images that most of these subreddits might have. Images of pancakes and printers are accurately artificial, while the same cannot be said for Illuminarty, because --

Illuminarty is not meant for non-art images. Interestingly, the examples that Illuminarty gets wrong tend to be examples that are somewhat photo-realistic. Pancakes and printers are identified as human despite having keen accuracy on artificial art examples that actually look like art.

In [35]:
detected_artificial_images = artificial_df[artificial_df['prediction'] == 'artificial']['filepath'].to_list()
detected_artificial_images = [Image.open(artificial_image).resize((256, 256)) for artificial_image in detected_artificial_images]
print('What umm-maybe correctly detected as artificial')
display(image_grid(detected_artificial_images, 1, 6))

detected_artificial_images = artificial_df[artificial_df['illuminarty prediction'] == 'human']['filepath'].to_list()
detected_artificial_images = [Image.open(artificial_image).resize((256, 256)) for artificial_image in detected_artificial_images]
print('What Illuminarty incorrectly detected as human')
display(image_grid(detected_artificial_images, 1, 8))
What umm-maybe correctly detected as artificial
What Illuminarty incorrectly detected as human

Conclusion

A never-ending battle

The tools that can help artists protect their livelihoods against malicious AI applications are relatively effective.

The tools we have now are not perfect, but are getting better. But as with most cryptography, code makers and code breakers will both devise more sophisticated ways to protect and break the others' algorithms.

It is comforting to know that these tools are an option out there for the artist, but only time will tell how AI and art develop, whether together or apart.

References

  • Canva. (n.d.). Free Online AI Image Generator. Retrieved from https://www.canva.com/ai-image-generator/
  • Cao, A. (n.d.). Ando - AI Copilot for Designers. Retrieved from https://www.figma.com/community/plugin/1145446664512862540/Ando---AI-Copilot-for-Designers
  • Adobe. (n.d.). Dream Bigger with Generative Fill. Retrieved from https://www.adobe.com/ph_en/products/photoshop/generative-fill.html
  • Clip Studio Paint (2022). Clip Studio Paint will no longer implement an image generator function. Retrieved from https://www.clipstudio.net/en/news/202212/02_01/
  • Punkett, L. (2023). Studio Denies Laying Off Artists For AI After Fans Spot Character With Six Fingers. Retrieved from https://kotaku.com/ai-art-layoff-video-game-studio-pc-midjourney-aigc-1850489333
  • Zhou, V. (2023). AI is already taking video game illustrators’ jobs in China. Retrieved from https://restofworld.org/2023/ai-image-china-video-game-layoffs/
  • Popli, N. (2022). He Used AI to Publish a Children’s Book in a Weekend. Artists Are Not Happy About It. Retrieved from https://time.com/6240569/ai-childrens-book-alice-and-sparkle-artists-unhappy/
  • javi_khoso. (2023, June 10). Spiderverse had one of the largest teams of animators to ever work on an animated movie, so I can assure you that it didn’t steal anyone’s job. THIS is ethical use of AI. It’s not stealing from anyone, and it’s making the animators life easier by eliminating repetitive tasks. [Tweet]. Retrieved from https://twitter.com/javi_khoso/status/1667463965532733440?s=20
  • foxville_art. (2022, December 12). As a children’s author/illustrator, it is saddening to see these books bc, apart from the ethics of AI and stolen artwork, kids deserve better!!! [Tweet]. Retrieved from https://twitter.com/foxville_art/status/1602135206974345216?s=20
  • jdebbiel. (2022, December 11). this….this is fucking sad. I don’t know what to do [Tweet]. Retrieved from https://twitter.com/jdebbiel/status/1601663197031075840?s=20
  • Textual-inversion fine-tuning for Stable Diffusion using d🧨ffusers. (n.d.). Retrieved from https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_textual_inversion_training.ipynb
  • Mist. (n.d.). Retrieved from https://mist-project.github.io/index_en.html?fbclid=IwAR2-Bd52tpHqp568JnOaBa9uFm7-KhFwVkZkn1Kcco_q8ZC7Ht-RdqgPQXg_aem_th_AdzWAGqQHmCQ3tjKjhVpDKln1LLhYXr_sZavKOLdfOE1UG2cKiGstlEwB92dsTyxRR8
  • mist-project/mist. (n.d.). Retrieved from https://github.com/mist-project/mist
  • Liang, C., Wu, X., Hua, Y., Zhang, J., Xue, Y., Song, T., … Guan, H. (2023). Adversarial Example Does Good: Preventing Painting Imitation from Diffusion Models via Adversarial Examples. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2302.04578
  • Liang, C., & Wu, X. (2023). Mist: Towards Improved Adversarial Examples for Diffusion Models. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2305.12683
  • Maybe, M. (2022). Can an AI learn to identify “AI art”? Retrieved from https://medium.com/@matthewmaybe/can-an-ai-learn-to-identify-ai-art-545d9d6af226
  • umm-maybe/AI-image-detector. (n.d.). Retrieved from https://huggingface.co/umm-maybe/AI-image-detector
  • Illuminarty. (n.d.). Retrieved from https://app.illuminarty.ai/
  • Hugging Face. (n.d.). Textual Inversion. Retrieved from https://huggingface.co/docs/diffusers/training/text_inversion
  • Mist. (n.d.). Quick Start. Retrieved from https://mist-documentation.readthedocs.io/en/latest/content/quickstart.html#installation

No AI was used in writing this report.